[DRAFT] add harm categories to AdvBench Dataset #732

paulinek13 · 2025-02-22T19:45:42Z

Description

This PR aims to resolve #730 by adding a way to manually assign harm categories to the AdvBench dataset and enabling filtering support based on those categories.

Marked as a draft PR since I'm seeking confirmation on the approach

Tests and Documentation

pyrit/datasets/fetch_example_datasets.py

paulinek13 · 2025-03-04T07:53:40Z

pyrit/datasets/harm_categories/adv_bench_dataset.json

I actually tried using the categories from the AdvBench paper (profanity, graphic depictions, threatening behavior, misinformation, discrimination, cybercrime, and dangerous or illegal suggestions) but, honestly, the results aren't great: pretty much everything ends up under dangerous or illegal suggestions, which isn't super useful IMO 😅

{ "0": ["cybercrime", "dangerous or illegal suggestions"], "1": ["cybercrime", "dangerous or illegal suggestions"], "2": ["dangerous or illegal suggestions"], "3": ["dangerous or illegal suggestions"], "4": ["cybercrime", "dangerous or illegal suggestions"], "5": ["dangerous or illegal suggestions"], "6": ["discrimination", "dangerous or illegal suggestions"], "7": ["dangerous or illegal suggestions"], ... }

So then I tested out that Collaborative, Human-Centered Taxonomy of AI, Algorithmic, and Automation Harms instead. It breaks things down into main categories (e.g.: Physical, Financial, Psychological) and subcategories (e.g.: Privacy Loss, Economic Instability, Coercion/manipulation), and honestly, the output looks a lot more meaningful. I think it captures nuance better instead of labelling everything into one or two types.
For now, I only categorized the first 50 prompts with the help of the Claude 3.7 model and I put the results in this file pyrit/datasets/harm_categories/adv_bench_dataset.json.

Well... I really hope I’m not making things more chaotic 😅 I just want to make sure we’re getting the best possible results!
Let me know what you think :)

romanlutz reviewed Feb 24, 2025

View reviewed changes

pyrit/datasets/fetch_example_datasets.py Show resolved Hide resolved

paulinek13 added 2 commits March 1, 2025 10:47

init

0968fa2

categorization idea

a8ec1e4

paulinek13 force-pushed the adv_bench_dataset branch from 04f97c3 to a8ec1e4 Compare March 4, 2025 07:52

paulinek13 commented Mar 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] add harm categories to AdvBench Dataset #732

[DRAFT] add harm categories to AdvBench Dataset #732

paulinek13 commented Feb 22, 2025

paulinek13 Mar 4, 2025 •

edited

Loading

[DRAFT] add harm categories to AdvBench Dataset #732

Are you sure you want to change the base?

[DRAFT] add harm categories to AdvBench Dataset #732

Conversation

paulinek13 commented Feb 22, 2025

Description

Tests and Documentation

paulinek13 Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

paulinek13 Mar 4, 2025 •

edited

Loading